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Many biological characteristics of evolutionary interest are not scalar vari- 
ables but continuous functions. Given a dataset of function-valued traits 
generated by evolution, we develop a practical, statistical approach to 
infer ancestral function-valued traits, and estimate the generative evolution- 
ary process. We do this by combining dimension reduction and phylogenetic 
Gaussian process regression, a non-parametric procedure that explicitly 
accounts for known phylogenetic relationships. We test the performance of 
methods on simulated, function-valued data generated from a stochastic 
evolutionary model. The methods are applied assuming that only the 
phylogeny, and the function-valued traits of taxa at its tips are known. 
Our method is robust and applicable to a wide range of function-valued 
data, and also offers a phylogenetically aware method for estimating the 
autocorrelation of function-valued traits. 
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1. Introduction 

The number, reliability and coverage of evolutionary trees are growing rapidly 
[1,2]. However, knowing organisms' evolutionary relationships through phylo- 
genetics is only one step in understanding the evolution of their characteristics 
[3]. Three issues are particularly challenging. The first is limited information: 
empirical information is typically available only for extant taxa, represented 
by tips of a phylogenetic tree, whereas evolutionary questions frequently con- 
cern unobserved ancestors deeper in the tree. The second is dependence: the 
available information for different organisms in a phylogeny is independent 
because a phylogeny describes a complex pattern of non-independence; 
observed variation is a mixture of this inherited variation and specific variation 
[4]. The third is high dimensionality: the emerging literature on function-valued 
traits [5-7] recognizes that many characteristics of living organisms are best 
represented as a continuous function rather than a single factor or a small 
number of correlated factors. Such characteristics include growth or mortality 
curves [8], reaction norms [9] and distributions [10], where the increasing 
ease of genome sequencing has greatly expanded the range of species in 
which distributions of gene [11] or predicted protein [12] properties are avail- 
able. Therefore, a function-valued trait is defined as a phenotypic trait that 
can be represented by a continuous mathematical function [9]. 

Previous work [13] proposed an evolutionary model for function-valued 
data d related by a phylogeny T. The data are regarded as observations of a 
phylogenetic Gaussian process (PGP) at the tips of T. That work shows that a 
PGP can be expressed as a stochastic linear operator X on a fixed set </> of 
basis functions (independent components of variation), so that 



(1.1) 



However, the study does not address the linear inverse problem of obtaining 
estimates 4> and X of 4> and X: our first contribution in this paper is to provide 
an approach to this problem in §2.2 via independent principal component 
analysis (IPCA; [14]). 
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We refer to X as the mixing matrix, and to the (z',/)th entry 
of X as the mixing coefficient of the zth basis function at the ;th 
taxon. It is these mixing coefficients that we model as evol- 
ving. For each fixed value of i, the Xy are correlated (owing 
to phylogeny) as ; varies over the taxa; the basis functions 
themselves do not evolve in our model. 

In §2.3, we address the problem of estimating the statistical 
structure of the mixing coefficients by performing phylogenetic 
Gaussian process regression (PGPR) on each of the rows of 
X separately. This corresponds to assuming independence 
between the rows (i.e. that the coefficients of the different 
basis functions evolve independently). It is commonly argued 
in the quantitative genetics literature [15] that evolutionary pro- 
cesses can be modelled as Ornstein-Uhlenbeck (OU) 
processes. Under these assumptions, the estimation of the for- 
ward operator reduces to the estimation of a small vector y of 
parameters [13]. In §2.1, we clarify the interpretation of these 
parameters in evolutionary contexts. The explicit PGPR pos- 
terior likelihood function is then used to obtain maximum- 
likelihood estimates (MLEs) for y. The estimation of y is 
known to be a challenging statistical problem [16]. We suggest 
an approach based on the principle of bagging [17] in §2.4. 

Our final contribution (§2.5) addresses the problem of 
estimating the function-valued traits of ancestral taxa. The 
earlier-mentioned PGPR step also returns a posterior distri- 
bution for the mixing coefficient of each basis function at 
each ancestral taxon in the phylogeny. At any particular 
ancestor, the estimated basis functions may be combined stat- 
istically, using the posterior distributions of their respective 
mixing coefficients, to provide a posterior distribution for 
the function-valued trait. Because the univariate posterior 
distributions are Gaussian, and the mixing is linear, the pos- 
terior for the function-valued trait has a closed form 
representation as a GP (equation (2.6)) that provides a 
major analytical and computational advantage for the 
approach. We can verify the methods proposed by using a 
PGP as a stochastic generative model. This simulates corre- 
lated function-valued traits across the taxa of T. Given only 
the phylogeny and the function-valued traits of taxa at its 
tips, our estimates for ^> and the ancestral functions are 
then compared with the simulation. 

Overall, our three methods (in §§2.2, 2.4, 2.5) appropri- 
ately combine developments in functional data analysis 
with the evolutionary dynamics of quantitative phenotypic 
traits, allowing non-parametric Bayesian inference from phy- 
logenetically correlated function-valued traits. An outline of 
the framework presented in this study can be found in 
figure 1. 



2. Methods and implementation 

2.1. Artificial evolution of function-valued traits 

We begin by generating a random phylogenetic tree T with 
128 tips, shown in figure 2. This fixes the experimental 
design for our simulation and inference, but further simu- 
lations given in the electronic supplementary material 
confirm that the statistical performance of our methods is 
consistent across a range of choices for T. Branch length dis- 
tributions are surprisingly consistent across organisms [18]; 
branch lengths were drawn from the empirical branch 
length distribution (see electronic supplementary material, 
section SI) extracted from TreeFam v. 8.0 [2]. 
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Figure 1. The three methods presented in this paper (ovals) and their 
interrelationships. 




Figure 2. The random phylogenetic tree used and examples of the function- 
valued traits shown at the tips (extant taxa) and the internal nodes (ancestral 
taxa). A subset of these is used in figure 5. 

Second, we chose a basis 4> in equation (1.1). We have no 
reason a priori to suppose that this basis is orthogonal and, in 
general, there is no reason for our inference procedure to be sen- 
sitive to the particular shape of the basis functions. The three 
simple non-orthogonal, unimodal functions shown in figure 3 
were therefore chosen as examples. For computational pur- 
poses, each basis function was stored numerically as a vector 
of length 1024, so that the basis matrix <$> was of size 3 x 1024 
and its zth row stored the zth basis function. 

Third, different mixing coefficients were generated by a 
phylogenetic OU process for each basis function and stored 
in the respective row of X. Our modelling assumption is 
that the mixing coefficients for distinct basis functions </>!, 
4> 2 , 4>3 are statistically independent of each other: in equation 
(1.1), this means that the rows of X are independent. It is 
therefore sufficient to describe the stochastic process generat- 
ing X if the zth row of X with i G {1, 2, 3}. We calculated the 
mixing matrix at the 128 tip taxa so X is of size 3 x 128. 
The 'true' ancestral values were established by generating 
phylogenetic OU processes over the whole phylogeny. The 
values of this process at tip taxa were stored in a row 




vector Xi (X z is a simulation of the tip taxa mixing coefficients X z 
excluding the non-phylogenetic variation), and its values at 
internal taxa were stored in a row vector W z for performance 
analyses in §2.5. To simulate the additional effect of non- 
phylogenetic variation (e.g. due to measurement error or 
environmental effects), independent (i.e. non-phylogenetic) 
variation was added to each entry of X z : 

X, = Xi + e u 

where e { is a 1 x 128 vector of independent Gaussian errors 
with mean 0 and standard deviation o^, and, finally, the 
matrix multiplication in equation (1.1) was performed to 
obtain the simulated data d. The 'extant' function-valued trait 
at tip taxon j is thus J^Li $i ( a vec tor of length 1024), whereas 
the ancestral function-valued trait at internal taxon g is 
X)Li Wigfa. The ancestral function-valued traits therefore exhi- 
bit only the phylogenetic part of simulated variation, whereas 
the extant function-valued traits exhibit both phylogenetic 
and non-phylogenetic variation. Of course, it is not possible 
to reconstruct non-phylogenetic variation using phylogenetic 
methods: we simulate non-phylogenetic variation only to 
demonstrate that it does not prevent the reconstruction of the 
phylogenetic part of variation for ancestral taxa in §§2.2-2.5. 

We now comment on the specific parameters chosen for 
the phylogenetic OU processes mentioned earlier. As in 
Hansen [19], we refer to the strength of selection parameter a 
and the random genetic drift a: we add superscripts to these 
parameters to distinguish between the three different OU 
processes. With this notation, the mixing coefficients for the 
row X{ have the following covariance function: 

Ki Y (t 1 ,t 2 ) = E[X lj X lg ] 

= (ai) 2 exp(-2^D T (t ; ,t g )) + (c/ n )% g , ^ 



Table 1. The fixed values used for the parameters in equation (2.1) to 
generate the mixing coefficients X S j. Each row constitutes a value of y 1 . 
6.17 and 2.06 correspond to 0.75 and 0.25 of the tree's £ mMI respectively. 
When / = 2, £' is not applicable, because there is no phylogenetic 
variation in the sample. 
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where oj = y (o i ) 2 /2a i , D T (t ; , t g ) denotes the phylogenetic or 
patristic distance (i.e. the distance in T) between the ;th and 
gth tip taxa, a n is defined as earlier, and 

^ _ f 1 , if tj = tg and tj is a tip taxon, 
\ 0, otherwise 

adds non-phylogenetic variation to extant taxa as discussed 
earlier, i.e. (f evaluates to 1 only for extant taxa, thus a n quan- 
tifies within-species genetic or environmental effects and 
measurement error in the zth mixing coefficient. We see 
from equation (2.1) that the proportion of variation in the 
row Xi attributable to the phylogeny is (oj) 2 / (oj) 2 + (c^) 2 • 

In the Gaussian process regression (GPR) literature in 
machine learning, l/2a is equivalent to £, the characteristic 
length-scale [20] of decay in the correlation function and 
in the following we work with the latter. For all of the OU 
processes, we used characteristic length-scales relative to 
8.22, the maximum patristic distance (£ max ) between two 
extant taxa for our simulated tree (figure 2). The values we 
used are given in table 1. In particular, oi = 0 when i = 2 



and it follows that the characteristic length-scale £ plays no 
role for this OU process, and equally we do not define the 
strength-of-selection parameter ol when i = 2. 

2.2. Dimensionality reduction and source separation for 
function-valued traits 

Given a dataset d of function-valued traits, we would like to 
find appropriate estimates X and <f> of the mixing matrix X 
and the basis set <fi, respectively. The first task is to identify 
a good linear subspace S of the space of all continuous func- 
tions by choosing basis functions appropriately. The purpose 
is to work, not with the function-valued data directly, but 
with their projections in S. We may say that the chosen sub- 
space S is good, if the projected data approximate the original 
data well, whereas the number of basis functions is not 
unnecessarily large so that S has the 'effective' dimension 
of the data. 

We then face a linear inverse problem: given the dataset d 
of function-valued traits, the task is to generate estimates 
X and 4> (equation (1.1)). This task is also known as source 
separation [21], which has a variety of implementations 
making different assumptions about the basis <fi and mixing 
coefficients X. One widely used approach is principal com- 
ponents analysis (PC A) [22], which returns orthogonal sets 
of basis functions to explain the greatest possible variation. 
PCA has been extended to take account of phylogenetic 
relationships [23], however, if a sample of functions is gener- 
ated by mixing non-orthogonal basis functions, the PCs of 
the sample (whether or not they account for phylogeny) will 
not equal the basis curves, due to the assumption of orthogon- 
ality (figure 3). In the independent component analysis (ICA), 
the alternative assumption is made that the rows X x - of X are 
statistically independent. This assumption fits more naturally 
with our modelling assumptions, because we assume that 
the rows X z are mutually independent [21]. ICA has proved 
fruitful in other biological applications [24] as has passing 
the results of PCA to ICA, which has been termed IPCA [14]. 

PCA is an appropriate tool for identifying the effective 
dimension of a high-dimensional dataset [25]. Therefore, to 
achieve both dimension reduction and source separation, we 
first applied PCA to the dataset d (the 128 function-valued 
traits at the tips of T) to determine the appropriate number of 
basis functions. The PCs were then passed to the CubICA imple- 
mentation of ICA [26]. CubICA returned a new set of basis 
functions (figure 3d) that were taken as the estimated basis 4> . 

2.3. Phylogenetic Gaussian process regression 

ICA also returns the estimated mixing coefficients at tip taxa, 
X. Our next step was to perform PGPR [13] separately 
on each row X,-, assuming knowledge of the phylogeny T, 
in order to obtain posterior distributions for all mixing 
coefficients throughout the tree T. 

GPR [20] is a flexible Bayesian technique in which prior 
distributions are placed on continuous functions. Its range 
of priors includes the Brownian motion and OU processes, 
which are by far the most commonly used models of charac- 
ter evolution [15,27]. Its implementation is particularly 
straightforward, because the posterior distributions are also 
GPs and have closed forms. We now give a brief exposition 
of GPR, using notation standard in the machine learning 
literature [20]. 



A GP may be specified by its mean surface and its covari- 
ance function K(y), where y is a vector of parameters. Because 
the components of y parametrize the prior distribution, they 
are referred to as /zyperparameters. The GP prior distribution 
is denoted 

/~jV(0,K(y)). 

If x* is a set of unobserved coordinates and x is a set of 
observed coordinates, then the posterior distribution of the 
vector fix*) given the observations f(x) is 

f(?)\f(x)~tf(A,B), (2.2) 

where 

A = K(x*,x,y)K(x,x,y)- 1 f(x), (2.3) 

and 

B = K{x\x\y) - K(*T,x, y)K(x,x, y^Kix^x, yf (2.4) 

and K(x*, x, y) denotes the |x*| x \x\ matrix of the covariance 
function K evaluated at all pairs x* G X* , Xj G X. Equations 
(2.3) and (2.4) convey that the posterior mean estimate will 
be a linear combination of the given data and that the pos- 
terior variance will be equal to the prior variance minus the 
amount that can be explained by the data. Additionally, the 
log-likelihood of the sample fix) is 

logp(/(x)|y) = _I/ W T K(x,x,y)-7(x) 

- ilog(det(K(x,x, y))) - ^log277. (2.5) 

It can be seen from equation (2.5) that the MLE is subject both 
to the fit it delivers (the first term) and the model complexity 
(the second term). Thus, GPR is non-parametric in the sense 
that no assumption is made about the structure of the model: 
the more data gathered, the longer the vector fix), and the 
more intricate the posterior model for fix*). 

PGPR extends the applicability of GPR to evolved func- 
tion-valued traits. A PGP is a GP indexed by a phylogeny 
T, where the function-valued traits at each pair of taxa are 
conditionally independent given the function-valued traits 
of their common ancestors. When the evolutionary process 
has the same covariance function along any branch of T 
beginning at its root (called the marginal covariance function), 
these assumptions are sufficient to uniquely specify the 
covariance function of the PGP, K T . As we assume that T is 
known in our inverse problem, the only remaining modelling 
choice is therefore the marginal covariance function. As can 
be seen from equation (2.1), K is a function of patristic dis- 
tances on the tree rather than Euclidean distances as 
standard in spatial GPR. 

In comparative studies, where one has observations at 
the tips of T, the covariance function K T may be used to con- 
struct a GP prior for the function-valued traits, allowing 
functional regression. In the model that we use, this is equiv- 
alent to specifying a Gaussian prior distribution for the 
mixing coefficients Yy and X,y. This may be carried out by 
regarding the row vectors Y z and X x - as observations of a uni- 
variate PGP. As noted in Jones & Moriarty [13], if we assume 
that the evolutionary process is Markovian and stationary, 
then the modelling choice vanishes, and the marginal covari- 
ance function is specified uniquely: it is the stationary OU 
covariance function. If we also add explicit modelling of 
non-phylogenetically related variation at the tip taxa, the 




to 
b 



-1 0 1 
relative error 



-1 0 1 
relative error 



1 0 1 
relative error 



Figure 4. Kernel density estimates of the relative errors in 1024 runs of the y estimation procedure, each time for a different tree, a different set of mixing 
coefficients and a different set of parameters in y; no components of y are assumed to be known beforehand. Estimation results are commented on in §3. 
The median values shown by the dotted line are (-0.073, -0.131 and 0.001), respectively. 



CD 

LaJ 

ho 



univariate prior covariance function has the unique func- 
tional form presented in equation (2.1). We do not assume 
knowledge of the parameters of equation (2.1), however, 
their estimation is the subject of §2.4. 

2.4. Hyperparameter estimation 

Because the posterior distributions returned by PGPR depend 
on the hyperparameter vector y, we must estimate y in 
order to reconstruct ancestral function-valued traits, and the 
estimation procedure should correct for the dependence 
owing to phylogeny. MLE of the phylogenetic variation, 
non-phylogenetic variation and characteristic length-scale 
hyperparameters oj, & n and l\ respectively, may be attempted 
numerically using the explicit prior likelihood function 
(equation (2.5)). Because estimating cr^ and £ alone is challen- 
ging [16] (although the estimation improves significantly with 
increased sample size), and we have further increased the chal- 
lenge by introducing non-phylogenetic variation, we propose 
an improved estimation procedure using the machine learning 
technique bagging [17], which a member of the boosting frame- 
work [22]. We show that these estimates may be further 
improved if one knows the value of the ratio (oy) 2 /(<j w ) 2 , 
which is closely related to Pagel's A [28]. 

Bagging (bootstrap aggregating) seeks to reduce the var- 
iance of an estimator by generating multiple estimates and 
averaging. It is simple to implement given an existing esti- 
mation procedure: one adds a loop front end that selects a 
bootstrap sample and sends it to the estimation procedure 
and a back end that aggregates the resulting estimates [17]. 
We generated 100 (sub) trees of 100 taxa by sampling without 
replacement our original 128 taxa tree, obtained the MLE for 
y on each subtree, and averaged these estimates to obtain the 
aggregated estimate y. Our results are shown in table 2: for 
i = 1 and i = 3, given our moderate sample size (128 taxa), 
the accuracy of these results is at least in line with the state 
of the art [16] despite the additional challenge posed by 
non-phylogenetic variation. For i = 2, where phylogenetic 
variation is absent from the generative model (cr^ = 0), our 
estimation procedure indicates its absence by returning esti- 
mates for £ whose magnitude is unrealistically small for the 
examined tree (less than the first percentile of the tree's 



Table 2. The bagging estimates for the hyperparameters in equation (2.1) 
(standard deviations of bagging estimates in parentheses). Each row 
corresponds to a given estimate of the vector y 1 . These estimates provide 
the maximum-likelihood value for equation (2.5) and are comparable with 
the original ones from table 1. 
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patristic distances). Commenting further on this matter, 
exceptionally small characteristic length-scales relative to the 
tree patristic distances, as seen here, practically suggest 
taxa-specific phylogenetic variation, i.e. non-phylogenetic 
variation. This holds also in its reverse: exceptionally large 
characteristic length-scales suggest a stable, non-decaying 
variation across the examined taxa that is indifferent to 
their patristic distances, again suggesting the absence of 
phylogenetic variance among the nodes. 

To assess the robustness of this hyperparameter esti- 
mation method, we performed 1024 simulations, randomly 
regenerating the tree and parameter vector y each time (see 
electronic supplementary material, section S2). The accuracy 
of these estimates is shown in figure 4. Improved results 
when the ratio (oy) 2 /(cr w ) 2 is known a priori (e.g. through 
knowledge of Pagel's A) are also given in the electronic 
supplementary material, sections S2 and S3. Our ultimate 
aim is ancestor reconstruction rather than hyperparameter 
estimation per se, and this is the subject of §2.5. 

2.5. Ancestor reconstruction 

Having generated function-valued data (§2.1), extracted 
mixing coefficients X (§2.2) and performed hyperparameter 
estimation (§2.4), we may now perform PGPR (§2.3) on 
each row X z , to obtain the univariate Gaussian posterior dis- 
tribution for the mixing coefficient Wit* at any internal taxon 
t*. As discussed in §2.3, the GP prior distribution has covari- 
ance function (equation (2.1)). We have assessed the accuracy 
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Figure 5. Posterior distributions at three points in the phylogeny using the estimated 0 and y. The prediction made by the regression analysis is shown via the i § 

posterior mean (red line), the component of posterior variance due to phylogenetic variation (2 s.d., dark grey band) and non-phylogenetic variation (2 s.d., light j g 

grey band). The black line shows the simulated data enabling visual validation of the ancestral predictions, (c), the black line is the training data at a tip taxon the ! g 

red line and dark grey band represent the posterior distribution of its phylogenetic component, whereas the light grey band represents the estimated magnitude of j S 

non-phylogenetic variation. The root and internal taxon here are the same as those indicated in figures 2 and 3, and the tip is the second from bottom on the same j 8 
figure, (a) root estimate; (b) node estimate; and (c) tip estimate. (Online version in colour.) 



of our bagging estimate y in §2.4 and we now substitute 
y l into equation (2.1). Taking a simple and direct approach, 
our estimate (j) obtained in §2.2 may then be substituted 
into equation (1.1) to obtain the function-valued posterior 
distribution f t * for the function-valued trait at taxon t*. 
Because our estimated basis functions are stored numerically 
as vectors of length 1024, this gives the same discretion for 
the ancestral traits. 

Conditioning on our estimated mixing coefficients % for 
the tip taxa, the posterior distribution of W zt * is 

W ;i .~AA(A f ,B f ), 

where the vector A\ and matrix B\ are obtained from 
equations (2.3) and (2.4), taking x = %, x* = Wn* and y\ 
respectively, for our observation coordinates, estimation coor- 
dinates and hyperparameter vector. Because our prior 
assumption is that the rows of X are statistically independent 
of each other, it follows from equation (1.1) that 




The marginal distributions of this representation (mean 
and standard deviation) are shown in figure 5. 

Figure 5 compares the function-valued estimates ft* to the 
simulated function-valued traits at the root (figure 5a), 
an internal node (figure 5b), and at a tip (figure 5c). In 
figure 5a, b, the simulated function-valued data are shown 
in black, and can be seen typically to lie within two posterior 
standard deviations. In figure 5c, the black line is the 
observed function-valued trait at that tip: the red line and 
dark grey band represent the posterior distribution of its phy- 
logenetic component, and the light grey band represents the 
estimated magnitude of the additional non-phylogenetic 
variation. Uncertainty over the phylogenetic part of variation 
(dark grey band) decreases from root to tip, as all obser- 
vations are at the extant tip taxa. We note that the posterior 
distributions, even at the root, put clear statistical constraints 
on the phylogenetic part of ancestral function-valued data: in 



this (admittedly simulated and highly controlled) setting, we 
can reason effectively about ancestral function-valued traits. 

3. Discussion 

In §2.1, we have appealed to equation (1.1) in the setting of 
mathematical inverse problems where, given data d, the chal- 
lenge is to infer a forward operator G and model 4> such that 

d = G((/}), (3.1) 

and such problems are typically under-determined and 
require additional modelling assumptions [29]. Given a phy- 
logeny T and function-valued data d at its tips, we wish to 
infer the forward operator G T and model 4> such that 

d = G T (4>). (3.2) 

When the data d are a small number of correlated factors 
per tip taxon, a variety of statistical approaches are available 
[30,31]. When the data are functions, the PGPs [13,32] have 
been proposed as the forward operator and this is the 
approach we have taken in this work. 

Our dimensionality-reduction methodology in §2.2 
can be easily varied or extended. For example, any suitable 
implementation of PCA may be used to perform the initial 
dimension reduction step: in particular, if the data have an 
irregular design (as happens frequently with function- 
valued data), the method of Yao et al. [33] may be applied 
to account for this; the ICA step then proceeds unchanged. 
We also note that while we find the CubICA implementation 
of ICA to be the most successful in our signal separation task, 
other implementations such as FastICA [21] or JADE [34] can 
also be used. In general, ICA gives rows % of the estimated 
mixing matrix that are maximally independent under a par- 
ticular measure of independence involving, for example, 
higher sample moments or mutual information, in order to 
approximate the solution of the inverse problem in equation 
(1.1) under our assumption of independence between the 
rows of X. PCA and ICA have different purposes (respect- 
ively, orthogonal decomposition of variation and separation 



of independently mixed signals) and we use them sequen- 
tially in IPCA. IPCA is non-parametric and, in particular, 
both distributionally and phylogenetically agnostic. This 
means that unlike PCA, IPCA is robust to non-Gaussianity 
in the data and, unlike phylogenetically corrected PCA, 
IPCA is robust to mis-specification of the phylogeny and to 
mixed phylogenetic and non-phylogenetic variation in the 
data: any of these can be features of biological data. 

It can be seen in figure 4 that the estimation of £ is more 
challenging than the estimation of a n or oy, having greater 
bias and variance. This corresponds to the documented diffi- 
culty of estimating the parameter a in the OU model, 
particularly for smaller sample sizes. Our work on hyper- 
parameter estimation in §2.4 mitigates these difficulties due 
to small sample size [16,35] by using bagging in order to 
bootstrap our sample. Somewhat unintuively, bagging 
'works' exactly because the subsample y estimates are vari- 
able and thus we avoid overfitted final estimates (see 
electronic supplementary material, section S2). Conceptually, 
our work on hyperparameter estimation, when taken 
together with §2.2, relates to the character process models 
of Pletcher & Geyer [8] and orthogonal polynomial methods 
of Kirkpatrick & Heckman [5], which give estimates for 
the autocovariance of function-valued traits. Writing out 
equation (1.1) for a single function-valued trait (at the ;th 
tip taxon, say), our model may be viewed as 

3 3 

/(*) = X><M*) + X>M-(*), (3-3) 

z=l i=l 

where the mixing coefficient Xq has been expressed as the 
sum of gij, the genetic (i.e. phylogenetic) part of variation, 
plus ey, the non-phylogenetic (e.g. environmental) part of 
variation, just as in these references. Then, the autocorrelation 



of the function-valued trait is 

E[f(x 1 )f(x 2 )} = + «) 2 )«M*i)<M*2). (3.4) 

1=1 

The estimates of o{ and obtained in §2.4 may be 
substituted into equation (3.4) to obtain an estimate of the 
autocovariance of the function-valued traits under study. 
This estimate has the attractions both of being positive defi- 
nite (by construction) and of taking phylogeny into account. 

Various frameworks exist that could be used to generalize 
the method presented in §2.4, to model heterogeneity of evol- 
utionary rates along the branches of a phylogeny [36] or for 
multiple fixed [15] or randomly evolving [16,37] local 
optima of the mixing coefficients. For the stationary OU pro- 
cess, the optimum trait value appears only in the mean, and 
not in the covariance function, and so does not play a role as 
a parameter in GPR [20]. We have not implemented such 
extensions here, effectively assuming that a single fixed opti- 
mum is adequate for each mixing coefficient. Nonetheless, 
our framework is readily extensible to include such effects, 
either implicitly through branch-length transformations [38], 
or explicitly by replacing the OU model with the more 
general Hansen model [37]. 

R code for the IPCA, ancestral reconstruction and hyper- 
parameter estimation is available from https://github. 
com/fpgpr/. 
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